Principle component analysis is often performed during exploratory data analysis when there are many varaibles and/or multicollinearty is an issue. It is an orthogonal transformation which simply computes a new set of axes from which to observe the data. The first axis (or principle component) is always in the direction of greatest variability. The seond principle component is in the second greatest direction of variability, and so on. What PCoA gives you in terms of dimensionality reduction and independence, you lose in interpretability (since every principle component is a linear combination of every original varaible).
In the previous modeling section we saw clear evidence of multicollinearty and decided to remove 4 variables because of it. Here, we’ll use PCoA to eliminate the multicollinearity issue while retaining as much of the information as possible.
We’ll begin by loading the data and cleaning it up for the analysis:
# Load libraries
library(ggbiplot)
library(ggplot2)
library(plotly)
library(rsm)
library(spatial)
library(car)
# Load data and format the months factor for nice plotting
weather <- read.csv("weather_df.csv")
weather$Month = factor(weather$Month, levels(weather$Month)[c(5, 4, 8, 1, 9,
7, 6, 2, 12, 11,
10, 3)])
head(weather)
#Remove samples where data is missing from one or more variables
weather.complete <- weather[complete.cases(weather),]
#Generate principle components
weather.pca <- prcomp(weather.complete[,c(5, 9:15)],
center = TRUE,
scale. = TRUE)
#View the principle components
summary(weather.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.3525 1.0328 0.77055 0.70533 0.46621 0.23583
## Proportion of Variance 0.6918 0.1333 0.07422 0.06219 0.02717 0.00695
## Cumulative Proportion 0.6918 0.8251 0.89932 0.96151 0.98868 0.99563
## PC7 PC8
## Standard deviation 0.16444 0.08904
## Proportion of Variance 0.00338 0.00099
## Cumulative Proportion 0.99901 1.00000
After building the principle components for the data set, the sumary tells us that PC1 contains 69% of the total variability, PC2 contains 13% of the variability, and almost all of the variability is accounted for by PC1 - PC4. Interestingly, 82.5% of the total variability is accounted for in the first two principle componets. Because most of the varaition is in the first two principle components, we can move forward with PC1 and PC2 for further analysis.
One useful chart for visualizing principle components is a Biplot. In this type of chart, the original axes are projected onto a scaatter plot using principle components as the primary axes. This allows you to see how the principle components interact with the original variables.
ggbiplot(weather.pca, alpha = .25, ellipse = TRUE, groups = weather.complete$Month)
In this biplot it is evident that Solar Radiation plays a major role in determining PC1 and, therefore, the whole dataset. Furthermore, we can see how the variance increases in the winter months (pink) and decreases in the summer months (green/blue).
Now, we can look at the relationship between the principle components and the total precipitation by making scatter plots.
PCs <- weather.pca$x
PCs <- as.data.frame(PCs)
PC.complete <- cbind(PCs,weather.complete$Total_Precip, weather.complete$Month)
colnames(PC.complete)[9:10] <- c("Total_Precip", "Month")
ggplot(data = PC.complete, mapping = aes(x = PC1, y = Total_Precip)) +
geom_point(mapping = aes(color = Month))
ggplot(data = PC.complete, mapping = aes(x = PC2, y = Total_Precip)) +
geom_point(mapping = aes(color = Month))
ggplot(data = PC.complete, mapping = aes(x = PC3, y = Total_Precip)) +
geom_point(mapping = aes(color = Month))
plot_ly(PC.complete, x = ~PC1,
y = ~PC2,
z = ~Total_Precip,
type = "scatter3d",
color = ~Month,
mode = "markers",
marker = list(line = list(color = "black", width = .75)))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
Evidently, precipitation events are favored by low values of PC1 (the component related to solar radiation and temperature) and high values of PC2. I say favored because there are clearly many low PC1, high PC2 days on which precipitation did not occur.
Having analyzed the principle components and seen a clear response between the first two principle components and total precipitation, we can build a new regression model based on the principle components. I’m moving on with a fourth degree polynomial type model based on trail and error using the fit diagnostics in the previous section to guide me.
fit <- lm(Total_Precip ~ PC1 + PC1^2 + PC1^3 + PC1^4 +
PC1:PC2 + I(PC1^2):PC2 + #I(PC1^3):PC2 +
PC1:I(PC2^2) + I(PC1^2):I(PC2^2) +
I(PC2^3) + PC1:I(PC2^3) +
I(PC2^4), data = PC.complete)
summary(fit)
##
## Call:
## lm(formula = Total_Precip ~ PC1 + PC1^2 + PC1^3 + PC1^4 + PC1:PC2 +
## I(PC1^2):PC2 + PC1:I(PC2^2) + I(PC1^2):I(PC2^2) + I(PC2^3) +
## PC1:I(PC2^3) + I(PC2^4), data = PC.complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.701 -1.207 -0.364 0.288 139.264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.19168 0.14530 8.202 3.34e-16 ***
## PC1 -0.50659 0.06778 -7.474 9.86e-14 ***
## I(PC2^3) 0.34575 0.05381 6.426 1.50e-10 ***
## I(PC2^4) 0.07823 0.01983 3.945 8.15e-05 ***
## PC1:PC2 -0.67140 0.10348 -6.488 9.96e-11 ***
## PC2:I(PC1^2) 0.30590 0.02315 13.215 < 2e-16 ***
## PC1:I(PC2^2) -0.69515 0.05404 -12.863 < 2e-16 ***
## I(PC1^2):I(PC2^2) 0.21304 0.01782 11.957 < 2e-16 ***
## PC1:I(PC2^3) -0.22581 0.03425 -6.594 4.97e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.162 on 3346 degrees of freedom
## Multiple R-squared: 0.3768, Adjusted R-squared: 0.3753
## F-statistic: 252.9 on 8 and 3346 DF, p-value: < 2.2e-16
persp(fit, PC1 ~ PC2, zlab = "Total_Precip")
Though the model has all significant regression coefficients, the Adjusted R-squared value is still relatively low, indicating poor predictive ability. Now we can run the model diagnostics to see if principle component regression improved our model at all.
all_vifs <- car::vif(fit)
print(all_vifs)
## PC1 I(PC2^3) I(PC2^4) PC1:PC2
## 1.662429 3.129844 2.893834 3.088225
## PC2:I(PC1^2) PC1:I(PC2^2) I(PC1^2):I(PC2^2) PC1:I(PC2^3)
## 1.445386 1.912445 1.294419 3.058461
Notice that we have no multicollinearity issues as indicated by the above variance inflation factors.
# qq plot for studentized residuals
qqPlot(fit, main="QQ Plot")
## 1789 3712
## 1717 3299
The Q-Q Plot indicates that we are still not satisfying the normality assumption.
# Breusch-Pagan test
ncvTest(fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 11732.51, Df = 1, p = < 2.22e-16
# plot studentized residuals vs. fitted values
spreadLevelPlot(fit)
## Warning in spreadLevelPlot.lm(fit):
## 882 negative fitted values removed
## Warning in rlm.default(x, y, weights, method = method, wt.method =
## wt.method, : 'rlm' failed to converge in 20 steps
##
## Suggested power transformation: 0.002351599
Again the varainces are not constant.
The principle component analysis allowed us to observe some interesting trends in the weather data: including the importance of solar radiation and the increased varaince in the winter. Additinoally, the dimensionality reduction alowed us to build a visually appealing model, although principle component regression did not improve the performance of our model.